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Abstract 

Convolutional neural networks (CNN) are increasingly used in many areas of 
computer vision. They are particularly attractive because of their ability to “ab¬ 
sorb” great quantities of labeled data through millions of parameters. However, 
as model sizes increase, so do the storage and memory requirements of the classi¬ 
fiers. We present a novel network architecture, Frequency-Sensitive Hashed Nets 
(FreshNets), which exploits inherent redundancy in both convolutional layers and 
fully-connected layers of a deep learning model, leading to dramatic savings in 
memory and storage consumption. Based on the key observation that the weights 
of learned convolutional filters are typically smooth and low-frequency, we first 
convert filter weights to the frequency domain with a discrete cosine transform 
(DCT) and use a low-cost hash function to randomly group frequency parameters 
into hash buckets. All parameters assigned the same hash bucket share a single 
value learned with standard back-propagation. To further reduce model size we 
allocate fewer hash buckets to high-frequency components, which are generally 
less important. We evaluate FreshNets on eight data sets, and show that it leads to 
drastically better compressed performance than several relevant baselines. 


1 Introduction 

In the recent years convolutional neural networks (CNN) have lead to impressive results in object 
recognition ca, face verification 1241 and audio classification 1^ . Problems that seemed impossi¬ 
bly hard only five years ago can now be solved at better than human accuracy ifTSl . Although CNNs 
have been known for a quarter of a century ca , only recently have their superb generalization abili¬ 
ties been accepted widely across the machine learning and computer vision communities. This broad 
acceptance coincides with the release of very large collections of labeled data 121 . Deep networks 
and CNNs are particularly well suited to learn from large quantities of data, in part because they can 
have arbitrarily many parameters. As data sets grow, so do model sizes. In 2012, the first winner of 
the ImageNet competition that used a CNN had already 240MB of parameters and the most recent 
winning model, in 2014, required 567MB ll26l . 

Independently, there has been another parallel shift of computing from servers and workstations to 
mobile platforms. As of January 2014 there have already been more web searches through smart 
phones than computer^ Today speech recognition is primarily used on cell phones with intelligent 
assistants such as Apple’s Siri, Google Now or Microsoft’s Cortana. As this trend continues, we are 
expecting machine learning applications to also shift increasingly towards mobile devices. However, 
the disjunction of deep learning with ever increasing model sizes and mobile computing reveals 

^http://tinyurl.com/omd58sq 
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an inherent dilemma. Mobile devices have tight memory and storage limitations. For example, 
even the most recent iPhone 6 only features 1GB of RAM, most of which must be used by the 
operating system or the application itself. In addition, developers must make their apps compatible 
with the most limited phone still in circulation, often restricting models to just a few megabytes of 
parameters. 

In response, there has been a recent interest in reducing the model sizes of deep networks. Denil 
et al. HOl use low-rank decomposition of the weight matrices to reduce the effective number of 
parameters in the network. Bucilu et al. m and Ba et al. m show that complex models can be 
compressed into 1-layer neural networks. Independently, the model size of neural networks can be 
reduced effectively through reduced bit precision 0. 

In this paper we propose a novel approach for neural network compression targeted especially for 
CNNs. We build on recent work by Chen et al. Q, who show that weights of fully connected 
networks can be effectively compressed with the hashing trick (SOI. Due to the nature of local 
pixel correlation in images {i.e. spatial locality), filters in CNNs tend to be smooth. We transform 
these filters into frequency domain with the discrete cosine transform (DCT) (221. In frequency 
space, the filters are naturally dominated by low frequency components. Our compression takes this 
smoothness property into account and randomly hashes the frequency components of all CNN filters 
at a given layer into one common set of hash buckets. All components inside one hash bucket share 
the same value. As lower frequency components are more pronounced than higher frequencies, 
we allow collisions only between similar frequencies and allocate fewer hash buckets for the high 
frequencies (which are less important). 

Our approach has several compelling properties: 1. The number of parameters in the CNN is inde¬ 
pendent of the number of convolutional filters; 2. During testing we only need to add a low-cost hash 
function and the inverse DCT transformation to any existing CNN code for filter reconstruction; 3. 
During training, the hashed weights can be learned with simple back-propagation (21 —the gradient 
of a hash bucket value is the sum of gradients of all hashed frequency components in that bucket. 

We evaluate our compression scheme on eight deep learning image benchmark data sets and compare 
against four competitive baselines. Although all compression schemes lead to lower test accuracy as 
the compression increases, our FreshNets method is by far the most effective compression method 
and yields the lowest generalization error rates on almost all classification tasks. 

2 Background 

Feature Hashing (a.k.a the hashing trick) (8l|25l[30l has been previously studied as a technique for 
reducing model storage size. In general, it can be regarded as a dimensionality reduction method that 
maps an input vector cc G to a much smaller feature space via a mapping 0 : ^ where 

k d. The mapping 0 is a composite of two approximately uniform auxiliary hash functions 
h:N ^ {1,... ,k} and —The element of the /c-dimensional hashed input is 

defined as 

4>jix)= 

i:h(i)=j 

As shown in (30l, a key property of feature hashing is its preservation of inner product operations, 
where inner products after hashing produce the correct pre-hash inner product in expectation: 

4>{y)]4> = 

This property holds because of the bias correcting sign factor ^(i). With feature hashing, models 
are directly learned in the much smaller space which not only speeds up training and evaluation 
but also significantly conserves memory. For example, a linear classifier in the original space could 
occupy 0{d) memory for model parameters, but when learned in the hashed space only requires 
0{k) parameters. The information loss induced by hash collision is much less severe for sparse 
feature vectors and can be counteracted through multiple hashing (25l or larger hash tables (30i . 

Discrete Cosine Transform (DCT) (221 . Methods built on the DCT are widely used for compress¬ 
ing images and movies, including forming the standard technique for JPEG (2^ . DCT expresses a 
function as a weighted combination of sinusoids of different phases/frequencies where the weight of 
each sinusoid reflects the magnitude of the corresponding frequency in the input. When employed 
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with sufficient numerical precision and without quantization or other compression operations, the 
DCT and inverse DCT (projecting frequency inputs back to the spatial domain) are lossless. Com¬ 
pression is made possible in images by local smoothness of pixels (e.g. a blue sky) which can be 
well represented regionally by fewer non-zero frequency components. Though highly related to the 
discrete Fourier transformation (DFT), DCT is often preferable for compression tasks because of 
its spectral compaction property where weights for most images tend to be concentrated in a few 
low-frequency components of the DCT 1221 . Further, the DCT transformation yields a real-valued 
representation, unlike the DFT whose representation has imaginary components. Given an input 
matrix V G the corresponding matrix V G in frequency domain after DCT is defined as: 


d—l d—1 

^jlj2 ~ c(il, ^2: ^' 2 ) ^122 5 (1) 

il=0 22=0 
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is the cosine basis function, and Sj = y^ when j = 0 and = y | otherwise. We use the shorthand 

fdct to denote the DCT operation in Eq. 0, i.e. V = fdct{y)- The inverse DCT converts V from 
the frequency domain back to the spatial domain, reconstructing V without loss: 

d—l d—l 

^ 1*2 ~ ^31^32 c(n, ^25 ^'2) (^) 

jl = 0 ^ 2=0 

We denote the inverse DCT function in Eq. ^ as i.e. V = /^^^(V). 


3 Frequency-Sensitive Hashed Nets 


Here we present EreshNets, a method for using weight 
sharing to reduce the model size (and memory demands) 
of convolutional neural networks. Similar to the work of 
Chen et al. El, we achieve smaller models by randomly 
forcing weights throughout the network to share identical 
values. Unlike previous work, we implement the weight 
sharing and gradient updates of convolutional filters in the 
frequency domain. These sharing constraints are made 
prior to training, and we learn frequency weights under 
the sharing assignments. Since the assignments are made 
with a hash function, they incur no additional storage. 

Filters in spatial and frequency domain. Let the ma¬ 
trix G denote the weight matrix of the dxd 

convolutional filter that connects the input plane to 
the output plane. (Eor notational convenience we as¬ 
sume square filters and only consider the filters in a sin¬ 
gle layer of the network.) The weights of all filters in 
a convolutional layer can be denoted by a 4-dimensional 
tensor V G where m and n are the number 

of input planes and output planes, respectively, resulting 
in a total of m x n x parameters. Convolutional fil¬ 
ters can be represented equivalently in either the spatial 
or frequency domain, mapping between the two via the 
DCT and its inverse. We denote the filter in frequency 
domain as = fdct{y^^) G and recover the orig¬ 
inal spatial representation through 
defined in Eq. 0 and 0, respectively. The tensor of all 
filters is denoted v ^]^^xnxdxd 
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Eigure 1: A schematic illustration of 
EreshNets. Two spatial filters are re¬ 
constructed from the frequency weights 
in vector w. The frequency weights are 
accessed with two hash functions and 
then transformed to the spatial domain. 
The vector w is partitioned into sub¬ 
vectors shared by all entries with 
similar frequency (corresponding to in¬ 
dex sum j = ji j 2 ). Colors indicate 
which hash bucket was accessed. 


Random Weight Sharing by Hashing. We would like to reduce the number of model parameters 
to exactly K values stored in a weight vector w G where K <^mxnx d‘^. To achieve this, we 
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randomly assign a value from w to each filter frequency weight in V. A naive implementation of this 
random weight sharing would introduce an auxiliary matrix for V to track the weight assignments, 
using to significant additional memory. To address this problem, Chen et al. 0 advocate use of the 
hashing trick to (pseudo-)randomly assign shared parameters. Using the hashing trick, we tie each 
filter weight to an element of w indexed by the output of a hash function h{')\ 

Wh(k,e,ji,j2)^ (3) 

where h{k, ^,^ 1 ,^ 2 ) G {1, • • • , K}, and ^(/c, ^,^ 1 ,^ 2 ) G {±1} is a sign factor computed by a second 
hash function ^(•) to preserve inner-products in expectation as described in Section]^ With the 
mapping in Eq. 0, we can implement shared parameter assignments with no additional storage 
cost. (For a schematic illustration, see Figure The figure also incorporates a frequency sensitive 
hashing scheme discussed later in this section.) 

Gradients over Shared Frequency Weights. Typical convolutional neural networks learn filters in 
the spatial domain. As our shared weights are stored in the frequency domain, we derive the gradient 
with respect to filter parameters in frequency space. Following Eq. we express the gradient of 
parameters in the spatial domain w.r.t. their counterparts in the frequency domain: 

dV^j 

c(ii,i 2 , (4) 


Fet jC be the loss function adopted for training. Using standard back-propagation, we can derive the 
gradient w.r.t. filter parameters in the spatial domain, . By the chain rule with Eq. 0, we 

express the gradient of L in the frequency domain; 


dC 

3132 


d—1 d—1 

= EE 


dC 


dV’^^ 


i^=0i2=0 m 2 


d—1 d—1 

1S32 E E c(fl,« 2 ,il, j 2 ) 

ii=0 *2=0 


dC 

* 1*2 


(5) 


Comparing with Eq. Q, we see that the gradient in the frequency domain is merely the DCT of the 
gradient in the spatial domain: 
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We compute gradient for each shared weight Wh by simply summing over the gradient at each filter 
parameter where the weight is assigned, i.e. all where h = h{k^i^ ji^ j 2 )'. 

= E 

h=h{k,£,ji,j2) 

where [A]j^j^ denotes the (^ 1 ,^ 2 ) entry in matrix A. 

Frequency Sensitive Hashing. Figure shows a filter in spatial 
(left) and frequency (right) domains. In the spatial domain CNN 
filters are smooth im due to the local pixel smoothness in natural 
images. In the frequency domain this corresponds to components 
with large magnitudes in the low frequencies, depicted in the upper 
left half of in FigureCorrespondingly, the high frequencies, 
in the bottom right half of have magnitudes near zero. 

As components of different frequency groups tend to be of different 
magnitudes (and thereby varying importance to the spatial structure 
of the filter), we want to avoid collisions between high and low 
frequency components. Therefore, we assign separate hash spaces to different frequency groups. In 
particular, we partition the K values of w into sub-vectors ..., of sizes iGg,..., K 2 d- 2 , 

where Kj = K. This partitioning allows parameters with the same frequency, corresponding to 
their index sum j = + ^ 2 , to be hashed into a corresponding dedicated hash space . We rewrite 

Eq. ^ with the new frequency sensitive shared weight assignments: 
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Figure 2: An example of a 
filter in spatial (left) and fre¬ 
quency domain (right). 
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where h^{') maps an input key to a natural number in {1, • • • ,Kj} and j = ji + j 2 - 

We define a compression rate rj G (0,1] for each frequency region j and assign Kj = rjNj. A 
smaller Vj induces more collisions during hashing, leading to increased weight sharing. Since lower 
frequency components tend to be of higher importance, making collisions more hurtful, we com¬ 
monly assign larger rj (fewer collisions) to low-frequency regions. Intuitively, given a size budget 
for the whole convolutional layer, we want to squeeze the hash space of high frequency region to 
save space for low frequency regions. These compression rates can either be assigned by hand or 
determined programmatically by cross-validation, as demonstrated in Section 

4 Related Work 

Several recent studies have confirmed that there is significant redundancy in the parameters learned 
in deep neural networks. Recent work by Denil et al. cni learns parameters in fully-connected layers 
after decomposition into two low-rank matrices, i.e. W = AB where W G A G and 

B In this way, the original 0{mn) parameters could be stored with 0{k{m An)) storage, 

where k min(m,n). Several works apply related approaches to speed up the evaluation time 
with convolutional neural networks. Two works propose to approximate convolutional filters by 
a weighted linear combination of basis filters EUlISl. In this setting, the convolution operation 
only needs to be performed with the small set of basis filters. The desired output feature maps 
are computed by matrix multiplication as the weighted sum of these basis convolutions. Further 
speedup can be achieved by learning rank-one basis filters so that the convolution operations are 
very cheap to compute (HI [191. Based on this idea, Denton et al. (TTIl advocate decomposing 
the four-dimensional tensor of the filter weights into a sum of different rank-one, four-dimensional 
tensors. In addition, they adopt bi-clustering to group filters such that each subgroup can be better 
approximated by rank-one tensors. 

In each of these works, evaluation time is the main focus, with any resulting storage reduction 
achieved merely as a side effect. Other works focus entirely on compressing the fully-connected 
layers of CNNs iniEii. However, with the trend toward architectures with fewer fully connected 
layers and additional convolutional layers EH, compression of filters is of increased importance. 
Another technique for speeding up convolutional neural network evaluation is computing convolu¬ 
tions in the Fourier frequency domain, as convolution in the spatial domain is equivalent to (compar¬ 
atively lower-cost) element-wise multiplication in the frequency domain (2111281 . Unlike FreshNets, 
for a filter of size d x d and an image of size n x n where n > d, Mathieu et al. flTi convert the 
filter to its frequency domain of size n x n by oversampling the frequencies, which is necessary 
for doing element-wise multiplication with a larger image but also increases the memory overhead 
at test time. Training in the Fourier frequency domain may be advantageous for similar reasons, 
particularly when convolutions are being performed over large 3-D volumes m 

Most relevant to this work is HashedNets 0 which compresses the fully connected layers of deep 
neural networks. This method uses the hashing trick to efficiently implement parameter sharing prior 
to learning, achieving notable compression with less loss of accuracy than the competing baselines 
which relied on low-rank decomposition or learning in randomly sparse architectures. 

5 Experimental Results 

In this section, we conduct several comprehensive experiments on benchmark datasets to evaluate 
the performance of FreshNets. 

Datasets. We experiment with eight benchmark datasets: CIFARIO, CiFARlOO, SVHN and five 
challenging variants of MNIST. The CIFARIO dataset contains 60000 images of 32 x 32 pixels with 
three color channels. Images are selected from ten classes with each class consisting of 6000 unique 
instances. The ClFARlOO dataset also contains 60000 32 x 32 images, but is more challenging since 
the images are selected from 100 classes (each class has 600 images). For both CIFAR datasets, 
50000 images are designated for training and the remaining 10000 images for testing. To improve 
accuracy on CIFAR 100, we augment by horizontal refiection and cropping 113 , resulting in 0.8M 
training images. The SVHN dataset is a large collection of digits (10 classes) cropped from real- 
world scenes, consisting of 73257 training images, 26032 testing images and 531131 less difficult 
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Layer 

Operation 

Input dim. 

Inputs 

Outputs 

C size 

MP size 

Parameters 

1 

C,RL 

32x32 

3 

32 

5x5 


2K 

2 

C,MP,DO,RL 

32x32 

32 

64 

5x5 

2x2(2) 

51K 

3 

C,RL 

16x16 

64 

64 

5x5 


102K 

4 

C,MP,DO,RL 

16x16 

64 

128 

5x5 

2x2(2) 

205A 

5 

C,MP,DO,RL 

8x8 

128 

256 

5x5 

2x2(2) 

819A 

6 

FC,Softmax 

- 

4096 

10/100 



40/400A 


Table 1: Network architecture. C: Convolution. RL: ReLu. MP: Max-pooling. DO: Dropout. FC: 
Fully-connected. The number of parameters in the fully-connected layer is specific to 32 x 32 input 
images and varies with the number of classes, either 10 or 100 depending on the dataset. 



CNN 

(a) Compression^ 
DropFilt DropFreq LRD 

1/16 

HashedNets 

FreshNets 

CNN 

(b) Compression = 1/64 

LRD HashedNets FreshNets 

CIFARlO 

14.91 

54.87 

30.45 

23.23 

24.70 

21.42 

14.37 

34.35 

43.08 

30.79 

CIFARlOO 

33.66 

81.17 

55.93 

51.88 

48.64 

47.49 

33.76 

66.44 

67.06 

62.33 

SVHN 

3.71 

30.93 

14.96 

10.67 

9.00 

8.01 

3.69 

22.32 

23.31 

18.37 

MNIST-07 

0.80 

4.90 

2.20 

1.18 

1.10 

0.94 

0.85 

1.95 

1.77 

1.24 

ROT 

3.42 

29.74 

8.39 

4.79 

5.53 

3.87 

3.32 

9.90 

10.10 

6.60 

BG-ROT 

11.42 

88.88 

56.63 

20.19 

16.15 

18.43 

11.28 

35.64 

32.40 

27.91 

BG-RAND 

2.17 

90.10 

8.83 

2.94 

2.80 

2.63 

1.77 

4.57 

5.10 

3.62 

BG-IMG 

2.61 

89.41 

27.89 

4.35 

3.26 

3.97 

2.38 

7.23 

6.68 

8.04 


Table 2: Test error rates (in %) with compression factors 1/16 and 1/64. Convolutional layers were 
compressed by the indicated methods (DropFilt, DropFreq, LRD, HashedNets, and FreshNets), with 
no convolutional layer compression applied to CNN. The fully connected layer is compressed by 
HashNets for all methods, including CNN. 


images for additional training. In our experiments, we use all available training images, for a total 
of 604388 training samples. For the MNIST variants (TSl, each variation either reduces the training 
size (mnist-07) or amends the original digits by rotation (ROT), background superimposition (BG- 
RAND and BG-IMG), or a combination thereof (bg-ROT). We preprocess all datasets with whitening 
(except CIFARIOO and SVHN which were prohibitively large). 

Baselines. We compare the proposed FreshNets with four baseline methods: HashedNets O, 
low-rank decomposition (LRD) Cqi, filter dropping (DropFilt) and frequency dropping (DropFreq). 
HashedNets was first proposed to compress fully-connected layers in deep neural networks via the 
hashing trick. In this baseline, we apply the hashing trick directly to the convolutional layer by 
hashing filter weights in the spatial domain. This induces random weight sharing across all filters in 
a single convolutional layer. Additionally, we compare against low-rank decomposition of the con¬ 
volutional filters nni. Following the method in GH, we unfold the four-dimensional filter tensor 
to form a two dimensional matrix on which we apply the low-rank decomposition. The parameters 
of the decomposition are fine-tuned via back-propagation. DropFreq learns parameters in the DCT 
frequency domain but sets high frequency components to 0 to meet the compression requirement. 
DropFilt compresses simply by reducing the number of filters in each convolutional layer. 

All methods were implemented using Torch? (61 and run on NVIDIA GTX TITAN graphics cards 
with 2688 cores and 6GB of global memory. Model parameters are stored and updated as 32 bit 
fioating-point values]^ 

Comprehensive evaluation. We adopt the network network architecture shown in Table for all 
datasets. The architecture is a deep convolutional neural network consisting of five convolutional 
layers (with 5x5 filters) and one fully-connected layer. Before convolution, input feature maps 
are zero-padded such that output maps remain the same size as the (un-padded) input maps after 
convolution. Max-pooling is performed after convolutions in layers 2, 4 and 5 with filter size 2x2 
and stride 2, reducing both input map dimensions by half. Rectified linear units are adopted as the 
activation function throughout. The output of the network is a softmax function over labels. 


^The compression rates of all methods could be further improved by learning and storing parameters in 
lower precision fTlfTTl. 
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In this architecture, the convolu¬ 
tional layers hold the majority of 
parameters (1.2 million in convo¬ 
lutional layer v.^-. 40 thousand 

in the fully connected layer with 
10 output classes). During train¬ 
ing, we optimize parameters us¬ 
ing mini-batch gradient descent 
with batch size 64 and momen¬ 
tum 0.9. We use 20 percent of 
the training set as a validation set 
for early stopping. For FreshNets, 
we use a frequency-sensitive com¬ 
pression scheme which increases 
weight sharing among higher frequency components]^ For all baselines, we apply HashedNets O 
to the fully connected layer at the corresponding level of compression. All error results are reported 
on the test set. 

Table|^a) and (b) show the comprehensive evaluation of all methods under compression ratios 1/16 
and 1/64, respectively. We exclude DropFilt and DropFreq in Table |^b) because neither supports 
1/64 compression in this architecture for all layers. For all methods, the fully connected layer (top 
layer) is compressed by HashedNets Q at the corresponding compression rate. In this way, the 
final size of the entire network respects the specified compression ratio. For reference, we also 
show the error rate of a standard convolutional neural network (CNN, columns 2 and 8) with the 
fully-connected layer compressed by HashedNets and no compression in the convolutional layers. 
Excluding this reference, we highlight the method with best test error on each dataset in bold. 



Figure 3: Test error rates at varying compression levels for 
datasets ClFARlO (left) and ROT (right). 


We discern several general trends. In Ta¬ 
ble l^a), we observe the performance of the 
DropFilt and DropFreq at 1/16 compression. 
At this compression rate, DropFilt corresponds 
to a network 1/16 filters at each layer: 2, 4, 4, 
8, 16 at layers 1 — 5 respectively. This architec¬ 
ture yields particularly poor test accuracy, in¬ 
cluding essentially random predictions on three 
datasets. DropFreq, which at 1/16 compres¬ 
sion parameterizes each filter in the original 
network by only 1 or 2 low-frequency values in 
the DCT frequency space, performs with sim¬ 
ilarly poor accuracy. Low rank decomposition 
(LRD) and HashedNets each yield similar per¬ 
formance at both 1/16 and 1/64 compression. 
Neither explicitly considers the smoothness in¬ 
herent in learned convolutional filters, instead 
compressing the filters in the spatial domain. 
Our method, FreshNets, consistently outper¬ 
forms all baselines, particularly at the higher 
compression rate as shown in Table [2 L). Us¬ 
ing the same model in Table Figure ^ shows 
more complete curves of test errors with mul¬ 
tiple compression factors on the ClFARlO and 
ROT datasets. 



Frequency Partition Index 


Figure 4: Results with different frequency sensi¬ 
tive compression schemes, each adopting a differ¬ 
ent beta distribution as the compression rate for 
each frequency. The inner figure shows normal¬ 
ized test error of each scheme on ClFARlO with 
the beta distribution hyper-parameters. The outer 
figure depicts the five beta distributions (with col¬ 
ors matching the inner figure). 


Varying compression by frequency. As mentioned in Section]^ we allow a higher collision rate 
in the high frequency components than in the low frequency components for each filter. To demon¬ 
strate the utility of this scheme, we evaluate several hash compression schemes. Systematically, we 
set the compression rate of the frequency band Vj with a parameterized function, i.e. Vj = f{j). 

^We evaluate several frequency-sensitive schemes later in this section, but for this comprehensive evaluation 
we set frequency compression rates by a rescaled beta distribution with a = 0.25 and /3 = 2.5 for all layers. 
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(a) Standard CNN (b) FreshNets (c) HashedNets 

Figure 5: Visualization of filters learning on MNIST in (a) an uncompressed CNN, (b) a CNN com¬ 
pressed with FreshNets, and (c) a CNN compressed with HashedNets (compression rate 1/16 in 
both (b) and (c)). FreshNets preserves the smoothness of the filters, whereas HashedNets does not. 


In this experiment, we use the beta distribution: f{j; a, /3) = where x = 

is a real number between 0 and 1, k is the filter size, and Z is a normalizing factor such that the 
resulting distribution of parameters meets the target parameter budget K, i.e. ~ 

We adjust a and p to control the compression rate for each frequency region. As shown in Figure]^ 
we have multiple pairs of a and /3, each of which results in a different compression scheme. For 
example, if a = 0.25 and /3 = 2.5, the compression rate monotonically decreases as a function of 
component frequency, meaning more parameter sharing among high frequency components (blue 
curve in Figure]^. 

To quickly evaluate the performance of each scheme, we use a simple four-layer FreshNets where 
the first two layers are DCT-hashed convolutional layers (with 5x5 filters) containing 32 and 64 
feature maps respectively, and the last two layers are fully connected layers. We test FreshNets 
on ClFARlO with each of the compression schemes shown in Figure]^ In each, weight sharing 
is limited to be within groups of similar frequencies, as described in Section however number 
of unique weights shared within each group is varied. We denote the compression scheme with 
a,/3 = 1 (red curve) as di frequency-oblivious scheme since it produces a uniform compression 
independent of frequency. In the inset bar plot in Figure we report test error normalized by the 
test error of the frequency-oblivious scheme and averaged over compression rates 1, 1/2,1/4, 1/16, 
1/64, and 1/256. We can see that the proposed scheme with fewer shared weights allocated to high 
frequency components (represented by the blue curve) outperforms all other compression schemes. 
An inverse scheme where the high frequency regions have the lowest collision rate (purple curve) 
performs the worst. These empirical results fit our assumption that the low frequency components 
of a filter are more important than the high frequency components. 


Filter visualization. We investigate the smoothness of the learned convolutional filters in Figurej^ 
by visualizing the filter weights (first layer) of (a) a standard, uncompressed CNN, (b) FreshNets, 
and (c) HashedNets (with weight sharing in the spatial domain). For this experiment, we again 
apply a four layer network with two convolutional layers but adopt larger filters (11 x 11) for better 
visualization. All three networks are trained on MNIST, and both FreshNets and HashedNets have 
1/16 compression on the first convolutional layer. When plotting, we scale the values in each filter 
matrix to the range [0, 255]. Hence, white and black pixels stand for large positive and negative 
weights, respectively. We observe that, although more blurry due to the compression, the filter 
weights of FreshNets are still smooth while weights in HashedNets appear more chaotic. 


6 Conclusion 

In this paper we present FreshNets, a method for learning convolutional neural networks with dra¬ 
matically compressed model storage. Harnessing the hashing trick for parameter-free random weight 
sharing and leveraging the smoothness inherent in convolutional filters, FreshNets compresses pa¬ 
rameters in a frequency-sensitive fashion such that significant model parameters {e.g. low-frequency 
components) are better preserved. As such, FreshNets preserves prediction accuracy significantly 
better than competing baselines at high compression rates. 
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